K-nearest neighbour

Introduction

k-Nearest Neighbour (kNN) is one of the most simplest instance-based algorithm.
Instance-based learning or lazy learning refers to those methods, which do not create a general mapping between the input and output space, but rather creates predictions only when a new instance must be classified. Every time a new query is sent to the algorithm, it uses the training data, which is stored in memory, to create its predictions.
The aim of the method is to predict the output for an unseen datapoint.

Algorithm

For datapoint in training data
1 Calculate the distance between the unknown point and the elements in the training set
Select the **k** most similar datapoints from the training_data to the unknown point and the assign the most frequent class from the **k** closest datapoint.

The algorithm consists of two main steps: The distance calculation (step 1.1) and the class Assignment (step 2).

Distance calculation step:
- Distance function is used to calculate the distance between two points in the feature space. (E.g. Euclidean distance)
Class Assigment step:
- For example assigning the most frequent class in the neighbourhood of the target datapoint.

Discussion

Advantages

Easy to implement (requires only training data and distance function)
Often works well
Easy to adapt to new training data
No training required

Disadvantages

Inefficient (each decision requires k nearest neighbor query)
Does not deliver explicit knowledge about the data
Difficult to choose k

You can find appendices for this algorithm here:

Examples

You can find the corresponding scripts (if applicable), here:

R script

Table of contents: